Introduction to Distributional Regression

Lecture 2

Gillian Heller

NHMRC Clinical Trials Centre, University of Sydney

gillian.heller@sydney.edu.au

The GAMLSS model

The GAMLSS model

\begin{align*} y_{i}&\sim \mathcal{D}\left(\mu_i,\sigma_i,\nu_i,\tau_i\right) \\ g_1(\mu_i)&=\beta_0^{\mu}+s_1^{\mu}(x_{i1})+\ldots+s_p^{\mu}( x_{ip})\\ g_2(\sigma_i)&=\beta_0^{\sigma}+s_1^{\sigma}(x_{i1})+\ldots+s_p^{\sigma}( x_{ip})\\ g_3(\nu_i)&=\beta_0^{\nu}+s_1^{\nu}(x_{i1})+\ldots+s_p^{\nu}( x_{ip})\\ g_4(\tau_i)&=\beta_0^{\tau}+s_1^{\tau}(x_{i1})+\ldots+s_p^{\tau}( x_{ip}) \end{align*} OR

\begin{align*} y_{i}&\sim \mathcal{D}\left(\theta_{i1},\ldots,\theta_{iK}\right) \\ g_k(\theta_{ik})&=\beta_0^{k}+s_1^{k}(x_{i1})+\ldots+s_p^{k}( x_{ip})\quad\text{for }k=1,\ldots,K \end{align*}

Elements of the GAMLSS model

\begin{align*} y_{i}&\sim \textcolor{red}{\mathcal{D}\left(\theta_{i1},\ldots,\theta_{iK}\right)} \\ g_k(\theta_{ik})&=\beta_0^{k}+s_1^{k}(x_{i1})+\ldots+s_p^{k}( x_{ip}) \end{align*}

  • Response distribution \textcolor{red}{\mathcal{D}(\theta_1,\ldots,\theta_K)}

Elements of the GAMLSS model

\begin{align*} y_{i}&\sim \textcolor{red}{\mathcal{D}\left(\theta_{i1},\ldots,\theta_{iK}\right)} \\ g_k(\theta_{ik})&=\beta_0^{k}+\textcolor{blue}{s_1^{k}(x_{i1})+\ldots+s_p^{k}( x_{ip})} \end{align*}

  • Response distribution \textcolor{red}{\mathcal{D}(\theta_1,\ldots,\theta_K)}

  • Terms \textcolor{blue}{s_j^{k}(x_{ij})}

Elements of the GAMLSS model

\begin{align*} y_{i}&\sim \textcolor{red}{\mathcal{D}\left(\theta_{i1},\ldots,\theta_{iK}\right)} \\ \textcolor{green}{g_k}(\theta_{ik})&=\beta_0^{k}+\textcolor{blue}{s_1^{k}(x_{i1})+\ldots+s_p^{k}( x_{ip})} \end{align*}

  • Response distribution \textcolor{red}{\mathcal{D}(\theta_1,\ldots,\theta_K)}

  • Terms \textcolor{blue}{s_j^{k}(x_{ij})}

  • Link functions \textcolor{green}{g_k}(\cdot)


We will also discuss

  • Estimation

  • Model diagnostics

  • Parameter interpretation

Response distributions

Continuous distributions


Types

  • real line \mathbb{R}: -\infty<y<\infty

  • positive real line \mathbb{R}_+: 0<y<\infty

  • bounded interval \mathbb{R}_{(0,1)}: 0<y<1

Real line \mathbb{R}

2-parameter distributions

  • Normal NO(\mu, \sigma)

  • Logistic LO(\mu, \sigma) (leptokurtic)

  • Gumbel GU(\mu, \sigma) (left skewed)

  • Reverse Gumbel RG(\mu, \sigma) (right skewed)

Real line \mathbb{R}

2-parameter distributions

Normal N(0, \sigma) and Logistic LO(0, \sigma)

Code
family_pdf(NO, mu=0, sigma=c(0.5, 1,1.5), from=-8, to=8, title="Normal") -> p1

family_pdf(LO, mu=0, sigma=c(0.5, 1,1.5), from=-8, to=8, title="Logistic") -> p2

gridExtra::grid.arrange(p1, p2, nrow=1)

Real line \mathbb{R}

2-parameter distributions

Gumbel GU(0, \sigma) and reverse Gumbel RG(0, \sigma)

Code
family_pdf(GU, mu=0, sigma=c(0.5, 1,1.5), from=-5, to=3, title="Gumbel") -> p1

family_pdf(RG, mu=0, sigma=c(0.5, 1,1.5), from=-3, to=5, title="Reverse Gumbel") -> p2

gridExtra::grid.arrange(p1, p2, nrow=1)

Real line \mathbb{R}

3-parameter distributions

  • exponential Gaussian: exGAUS(\mu, \sigma, \nu) for modelling right skew data,

  • normal family: NOF(\mu, \sigma, \nu) for modelling mean and variance relationships following the power law;

  • power exponential: PE(\mu, \sigma, \nu) and PE2(\mu, \sigma, \nu) for modelling lepto- and platykurtotic data;

  • t family: TF(\mu, \sigma, \nu) and TF2(\mu, \sigma, \nu) for modelling leptokurtotic data;

  • skew normal: SN1(\mu, \sigma, \nu) and SN2(\mu, \sigma, \nu) for modelling skewness

Real line \mathbb{R}

3-parameter distributions

Skew Normal type 1 SN1(0, 1, \nu)

Code
x <- seq(-4,4, length=101)
f1 <- dSN1(x, mu=0, sigma=1, nu=-100)
f2 <- dSN1(x, mu=0, sigma=1, nu=-5)
f3 <- dSN1(x, mu=0, sigma=1, nu=0)
da1 <-  data.frame(f=f1, x=x, fam=rep("nu = -100", 101))
da2 <-  data.frame(f=f2, x=x, fam=rep("nu = -5",101)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("nu = 0",101)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam, levels=c("nu = -100", "nu = -5",  "nu = 0"))
p9 <- ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=1, aes(colour = fam)) +
      xlab("y") + ylab("f(y)")



f1 <- dSN1(x, mu=0, sigma=1, nu=100)
f2 <- dSN1(x, mu=0, sigma=1, nu=5)
f3 <- dSN1(x, mu=0, sigma=1, nu=0)
da1 <-  data.frame(f=f1, x=x, fam=rep("nu = 100", 101))
da2 <-  data.frame(f=f2, x=x, fam=rep("nu = 5",101)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("nu = 0",101)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam, levels=c("nu = 100", "nu = 5",  "nu = 0"))
p10 <- ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=1, aes(colour = fam)) +
      xlab("y") + ylab("f(y)")

gridExtra::grid.arrange(p9, p10, nrow=1)

Real line \mathbb{R}

3-parameter distributions

  • Power Exponential PE(0,1,\nu)
Code
x <- seq(-4,4, length=101)
f1 <- dPE(x, mu=0, sigma=1, nu=1)
f2 <- dPE(x, mu=0, sigma=1, nu=2)
f3 <- dPE(x, mu=0, sigma=1, nu=10)
f4 <- dPE(x, mu=0, sigma=1, nu=100)
da1 <-  data.frame(f=f1, x=x, fam=rep("nu=1", 101))
da2 <-  data.frame(f=f2, x=x, fam=rep("nu=2",101)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("nu=10",101)) 
da4 <-  data.frame(f=f4, x=x, fam=rep("nu=100",101)) 
da <- rbind(da1, da2, da3, da4)
da$fam <- factor(da$fam,
                 levels = c("nu=1","nu=2","nu=10","nu=100"))
p9 <- ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=1, aes(colour = fam))+
      xlab("y") + ylab("f(y)")
p9

Real line \mathbb{R}

3-parameter distributions

t family TF(0,1,\nu)

Code
x <- seq(-5,5, length=101)
f1 <- dTF(x, mu=0, sigma=1, nu=100)
f2 <- dTF(x, mu=0, sigma=1, nu=2)
f3 <- dTF(x, mu=0, sigma=1, nu=1)
da1 <-  data.frame(f=f1, x=x, fam=rep("nu=100", 101))
da2 <-  data.frame(f=f2, x=x, fam=rep("nu=2",101)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("nu=1",101)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam,
                 levels = c("nu=1","nu=2","nu=100"))
p9 <- ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=1, aes(colour = fam)) +
      xlab("y") + ylab("f(y)")
p9

Real line \mathbb{R}

4-parameter distributions

  • exponential generalised beta type 2,
    EGB2(\mu,\sigma,\nu,\tau) skewness and leptokurtosis;

  • generalised t, GT(\mu,\sigma, \nu, \tau) kurtosis;

  • Johnson's SU, JSU(\mu,\sigma, \nu, \tau) and \texttt{JSUo}(\mu,\sigma, \nu, \tau) skewness and leptokurtosis;

  • normal-exponential-t \texttt{NET}(\mu,\sigma, \nu, \tau), robustly location and scale

  • skew exponential power \texttt{SEP1}(\mu,\sigma, \nu, \tau), \texttt{SEP2}(\mu,\sigma, \nu, \tau), \texttt{SEP3}(\mu,\sigma, \nu, \tau) and \texttt{SEP4}(\mu,\sigma, \nu, \tau) skewness and lepto-platy;

  • sinh-arcsinh, \texttt{SHASH}(\mu,\sigma, \nu, \tau), \texttt{SHASHo}(\mu,\sigma, \nu, \tau) and \texttt{SHASHo2}(\mu,\sigma, \nu, \tau) skewness and lepto-platy;

  • skew t, \texttt{ST1}(\mu,\sigma, \nu, \tau), \texttt{ST2}(\mu,\sigma, \nu, \tau), \texttt{ST3}(\mu,\sigma, \nu, \tau), \texttt{ST4}(\mu,\sigma, \nu, \tau), \texttt{ST5}(\mu,\sigma, \nu, \tau) and \texttt{SST}(\mu,\sigma, \nu, \tau) skewness and leptokurtosis.

Real line \mathbb{R}

4-parameter distributions

SEP1(\mu=0, \sigma=1, \nu, \tau)

Code
p1 <- family_pdf(SEP1, mu=0, sigma=1, nu=c(0,1,2), tau=1, from=-4, to=4, title="SEP1(0, 1, nu = 0, 1, 2, tau = 1)")

p2 <- family_pdf(SEP1, mu=0, sigma=1, nu=c(0,1,2), tau=2, from=-4, to=4, title="SEP1(0, 1, nu = 0, 1, 2, tau = 2)")

p3 <- family_pdf(SEP1, mu=0, sigma=1, nu=c(0,1,2), tau=5, from=-2.5, to=2.5, title="SEP1(0, 1, nu = 0, 1, 2, tau = 5)")

p4 <- family_pdf(SEP1, mu=0, sigma=1, nu=c(0,-1,-2), tau=100, from=-2, to=2, title="SEP1(0, 1, nu = 0, 1, 2, tau = 100)")

gridExtra::grid.arrange(p1,p2,p3,p4, nrow=2)

Positive real line \mathbb{R}_+

1- and 2-parameter distributions

  • exponential, \texttt{EXP}(\mu,\sigma)

  • gamma \texttt{GA}(\mu,\sigma) member of the exponential family;

  • inverse gamma \texttt{IGAMMA}(\mu,\sigma);

  • inverse Gaussian, \texttt{IG}(\mu,\sigma) a member of the exponential family;

  • log-normal \texttt{LOGNO}(\mu,\sigma) and \texttt{LOGNO2}(\mu,\sigma).

  • Pareto \texttt{PARETO}(\mu,\sigma), \texttt{PARETO2o}(\mu,\sigma) and \texttt{GP}(\mu,\sigma) for heavy tail;

  • Weibull \texttt{WEI}(\mu,\sigma), \texttt{WEI2}(\mu,\sigma) and \texttt{WEI3}(\mu,\sigma) used in survival analysis.

Positive real line \mathbb{R}_+

2-parameter distributions

Weibull(\mu=1, \sigma)

Code
x <- seq(0.001,5, length=101)
f1 <- dWEI3(x, mu=1, sigma=0.5)
f2 <- dWEI3(x, mu=1, sigma=1)
f3 <- dWEI3(x, mu=1, sigma=6)
da1 <-  data.frame(f=f1, x=x, fam=rep("sigma = 0.5", 101))
da2 <-  data.frame(f=f2, x=x, fam=rep("sigma = 1",101)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("sigma = 6",101)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam,
                 levels = c("sigma = 0.5","sigma = 1","sigma = 6"))
ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=0.7, aes(colour = fam)) +
      ylim(c(0, 3)) +
      xlab("y") + ylab("f(y)") +
      theme_bw() +
      theme(legend.title=element_blank(),
            axis.text.x=element_text(size=16),
        axis.text.y=element_text(size=16),
        axis.title.x = element_text(size=16),
        axis.title.y = element_text(size=16),
        legend.text = element_text(size=16))

Weibull(\mu=2, \sigma)

Code
x <- seq(0.001,5, length=101)
f1 <- dWEI3(x, mu=2, sigma=0.5)
f2 <- dWEI3(x, mu=2, sigma=1)
f3 <- dWEI3(x, mu=2, sigma=6)
da1 <-  data.frame(f=f1, x=x, fam=rep("sigma = 0.5", 101))
da2 <-  data.frame(f=f2, x=x, fam=rep("sigma = 1",101)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("sigma = 6",101)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam,
                 levels = c("sigma = 0.5","sigma = 1","sigma = 6"))
ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=0.7, aes(colour = fam)) +
      ylim(c(0, 3)) +
      xlab("y") + ylab("f(y)") +
      theme_bw() +
      theme(legend.title=element_blank(),
            axis.text.x=element_text(size=16),
        axis.text.y=element_text(size=16),
        axis.title.x = element_text(size=16),
        axis.title.y = element_text(size=16),
        legend.text = element_text(size=16))

Positive real line \mathbb{R}_+

3-parameter distributions

  • Box-Cox Cole and Green, \texttt{BCCG}(\mu, \sigma, \nu) known also as the LMS method in centile estimation.

  • generalised gamma, \texttt{GG}(\mu, \sigma, \nu)

  • generalised inverse Gaussian, \texttt{GIG}(\mu, \sigma, \nu)

  • log-normal family, \texttt{LNO}(\mu, \sigma, \nu) based on the standard Box-Cox transformation

Positive real line \mathbb{R}_+

3-parameter distributions

BCCG(\mu=1, \sigma=0.4, \nu)

Code
x <- seq(0.001,3, length=501)
f1 <- dBCCG(x, mu=1, sigma=0.1, nu=-1)
f2 <- dBCCG(x, mu=1, sigma=0.3, nu=-1)
f3 <- dBCCG(x, mu=1, sigma=0.5, nu=-1)
da1 <-  data.frame(f=f1, x=x, fam=rep("nu=100", 501))
da2 <-  data.frame(f=f2, x=x, fam=rep("nu=2",501)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("nu=1",501)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam,
                 levels = c("nu=1","nu=2","nu=100"))
ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=0.5, aes(colour = fam)) +
      xlab("y") + ylab("f(y)") +
      theme_bw() +
      theme(legend.title=element_blank(),
            axis.text.x=element_text(size=12),
        axis.text.y=element_text(size=12),
        axis.title.x = element_text(size=12),
        axis.title.y = element_text(size=12),
        legend.text = element_text(size=12))

Positive real line \mathbb{R}_+

4-parameter distributions

  • Box-Cox power exponential, \texttt{BCPE}(\mu,\sigma,\nu,\tau) and \texttt{BCPEo}(\mu,\sigma,\nu,\tau) skewness and platy- and leptokurtosis;

  • Box-Cox t, \texttt{BCT}(\mu,\sigma,\nu,\tau) and \texttt{BCTo}(\mu,\sigma,\nu,\tau) skewness and leptokurtosis;

  • generalised beta type 2, \texttt{GB2}(\mu,\sigma,\nu,\tau) skewness and platy- and leptokurtosis

Positive real line \mathbb{R}_+

4-parameter distributions

BCT (\mu=1, \sigma=0.2, \nu, \tau=10)

Code
x <- seq(0.001,3, length=501)
f1 <- dBCT(x, mu=1, sigma=0.2, nu=-1, tau=10)
f2 <- dBCT(x, mu=1, sigma=0.2, nu=1, tau=10)
f3 <- dBCT(x, mu=1, sigma=0.2, nu=2, tau=10)
da1 <-  data.frame(f=f1, x=x, fam=rep("nu = -1", 501))
da2 <-  data.frame(f=f2, x=x, fam=rep("nu = 1",501)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("nu = 2",501)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam,
                 levels = c("nu = -1","nu = 1","nu = 2"))
ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=0.5, aes(colour = fam)) +
      xlab("y") + ylab("f(y)") +
      theme_bw() +
      theme(legend.title=element_blank(),
            axis.text.x=element_text(size=16),
        axis.text.y=element_text(size=16),
        axis.title.x = element_text(size=16),
        axis.title.y = element_text(size=16),
        legend.text = element_text(size=16))

BCPE (\mu=1, \sigma=0.2, \nu=2, \tau)

Code
x <- seq(0.001,2.5, length=501)
f1 <- dBCPE(x, mu=1, sigma=0.2, nu=2, tau=1)
f2 <- dBCPE(x, mu=1, sigma=0.2, nu=2, tau=2)
f3 <- dBCPE(x, mu=1, sigma=0.2, nu=2, tau=10)
da1 <-  data.frame(f=f1, x=x, fam=rep("tau = 1", 501))
da2 <-  data.frame(f=f2, x=x, fam=rep("tau = 2",501)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("tau = 10",501)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam,
                 levels = c("tau = 1","tau = 2","tau = 10"))
ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=0.5, aes(colour = fam)) +
      xlab("y") + ylab("f(y)") +
      theme_bw() +
      theme(legend.title=element_blank(),
            axis.text.x=element_text(size=16),
        axis.text.y=element_text(size=16),
        axis.title.x = element_text(size=16),
        axis.title.y = element_text(size=16),
        legend.text = element_text(size=16))

Bounded real line \mathbb{R}_{(0,1)}


Distribution gamlss family no par.
beta BE 2
beta original BEo 2
generalized beta type 1 GB1 4
logit normal LOGITNO 2
simplex SIMPLEX 2

Bounded real line \mathbb{R}_{(0,1)}

Beta(\mu=0.5,\sigma)

Code
x <- seq(0.001,0.999, length=101)
f1 <- dBE(x, mu=0.5, sigma=0.2)
f2 <- dBE(x, mu=0.5, sigma=0.5)
f3 <- dBE(x, mu=0.5, sigma=0.7)
da1 <-  data.frame(f=f1, x=x, fam=rep("sigma = 0.2", 101))
da2 <-  data.frame(f=f2, x=x, fam=rep("sigma = 0.5",101)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("sigma = 0.7",101)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam,
                 levels = c("sigma = 0.2","sigma = 0.5","sigma = 0.7"))
ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=0.5, aes(colour = fam)) +
      xlab("y") + ylab("f(y)") +
      theme_bw() +
      theme(legend.title=element_blank(),
            axis.text.x=element_text(size=16),
        axis.text.y=element_text(size=16),
        axis.title.x = element_text(size=16),
        axis.title.y = element_text(size=16),
        legend.text = element_text(size=16))

Simplex(\mu=0.5,\sigma)

Code
x <- seq(0.001,0.999, length=101)
f1 <- dSIMPLEX(x, mu=0.5, sigma=0.5)
f2 <- dSIMPLEX(x, mu=0.5, sigma=1)
f3 <- dSIMPLEX(x, mu=0.5, sigma=2)
da1 <-  data.frame(f=f1, x=x, fam=rep("sigma = 0.5", 101))
da2 <-  data.frame(f=f2, x=x, fam=rep("sigma = 1",101)) 
da3 <-  data.frame(f=f3, x=x, fam=rep("sigma = 2",101)) 
da <- rbind(da1, da2, da3)
da$fam <- factor(da$fam,
                 levels = c("sigma = 0.5","sigma = 1","sigma = 2"))
ggplot(data=da, aes(x = x, y=f, group=fam)) +
      ggplot2::geom_line(lwd=0.5, aes(colour = fam)) +
      xlab("y") + ylab("f(y)") +
      theme_bw() +
      theme(legend.title=element_blank(),
            axis.text.x=element_text(size=16),
        axis.text.y=element_text(size=16),
        axis.title.x = element_text(size=16),
        axis.title.y = element_text(size=16),
        legend.text = element_text(size=16))

Table of continuous distributions

Mixed distributions

  • mixed discrete-continuous

  • continuous with probability mass at discrete point(s)

  • aka semi-continuous


Zero-adjusted distributions

  • On \mathbb{R}_+ with a probability spike at zero

    • daily rainfall

    • amount of insurance claim in a year

    • annual medical expenditure

f(y \mid \boldsymbol{\theta}, \pi)= \begin{cases}\pi & \text { if } y=0 \\ (1-\pi) f_c(y \mid \boldsymbol{\theta}) & \text { if } y>0\end{cases}

Zero- and/or one-inflated distributions

  • On \mathbb{R}_{(0,1)} with a probability spike(s) at zero and/or one


Example: pain

  • A patient-reported measure of pain is the Visual Analogue Scale (VAS)

  • Subjects are given a linear scale and asked to put a mark where they perceive themselves

  • Responses are measured and scaled to [0, 1]

  • While 1 is not frequently recorded, 0 is common

f\left(y \mid \boldsymbol{\theta}, \pi_0, \pi_1\right)= \begin{cases}\pi_0 & \text { if } y=0 \\ \left(1-\pi_0-\pi_1\right) f_c(y \mid \boldsymbol{\theta}) & \text { if } 0<y<1 \\ \pi_1 & \text { if } y=1\end{cases}

Mixed distributions in GAMLSS


Distribution GAMLSS name Parameters
zero-adjusted inverse Gaussian ZAIG \mu, \sigma, \nu
zero-adjusted Gamma ZAGA \mu, \sigma, \nu
zero-inflated beta BEINF0, BEZI \mu, \sigma, \nu
one-inflated beta BEINF1, BEOI \mu, \sigma, \nu
zero- and one-inflated beta BEINF \mu, \sigma, \nu, \tau


  • Zero-adjusted and zero-and/or one-inflated versions of other continuous distributions can be user-defined

  • E. g. Chapter 9 of M. D. Stasinopoulos et al. (2024) implements a zero-and-one-inflated Simplex distribution

Discrete distributions: unbounded

  • \mathcal{S}=\{0,1,2,\ldots\}

  • Response is generally a count

  • Starting point: Poisson distribution

\begin{align*} y&\sim\text{PO}(\mu)\\ \mathbb{E}(y)&=\text{Var}(y)=\mu \end{align*}

  • Problems

    • Overdispersion/heavy tail \quad \text{Var}(y)>\mathbb{E}(y)

    • Excess/lack of zeroes

Discrete distributions



ZA = zero adjusted, ZI = zero inflated

Discrete distributions

Zero inflation

\begin{equation*} \begin{split} \mathbb{P}(Y=y)= \begin{cases}\textcolor{red}{\pi}+(1-\pi) \mathbb{P}\left(Y_1=0\right) & \text { if } y=0 \\ (1-\pi) \mathbb{P}\left(Y_1=y\right) & \text { if } y=1,2,3, \ldots \end{cases} \end{split} \end{equation*}

  • \mathbb{P}(Y_1) is the probability function of the response variable without zero inflation

  • \mathbb{P}(Y) is that of the zero-inflated response distribution.

  • Assumption: a proportion of the population (\pi) generates zero with certainty, and the remaining (1-\pi) generates counts according to the assumed response distribution.


Examples

  • number of falls, number of blood transfusions

  • species abundance counts

  • number of wildfires

  • etc …

Discrete distributions

Zero adjustment (aka hurdle models)

\begin{equation*} \begin{split} \mathbb{P}(Y=y)= \begin{cases}\textcolor{red}{\pi} & \text { if } y=0 \\ (1-\pi) \mathbb{P}\left(Y_2=y\right) & \text { if } y=1,2,3, \ldots \end{cases} \end{split} \end{equation*}

  • Y_2 is the zero-truncated version of the response variable.
  • In ZA (hurdle) model, the zero probability can be inflated or deflated with respect to the original zero probability

  • In ZI model, only inflation is possible.

  • If you need a ZI or ZA version of a distribution that isn’t supplied by gamlss (e.g. ZIWARING), this can be user-defined.

Discrete distributions

Bounded counts

  • \mathcal{S}=\{0, 1, \ldots, n\}

  • These generally are binomial-type responses

  • See Discrete distributions table for more information

Discrete distributions table

References

Rigby, R.A., Stasinopoulos, D.M., Heller, G.Z., & De Bastiani, F. (2019). Distributions for modeling location, scale, and shape: Using GAMLSS in R. Boca Raton: Chapman & Hall/CRC.
Stasinopoulos, D.M., Rigby, R.A., Heller, G.Z., Voudouris, V., & De Bastiani, F. (2017). Flexible regression and smoothing: Using GAMLSS in R. Boca Raton: Chapman & Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A., & Heller, G.Z. (2024). Generalized additive models for location, scale and shape: A distributional regression approach, with applications (Vol. 56). Cambridge University Press.